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Control Design for Markov Chains under 
Safety Constraints: A Convex Approach 

Eduardo Arvelo and Nuno C. Martins 
Abstract 

This paper focuses on the design of time-invariant memoryless control policies for fully observed controlled 
Markov chains, with a finite state space. Safety constraints are imposed through a pre-selected set of forbidden states. 
A state is qualified as safe if it is not a forbidden state and the probability of it transitioning to a forbidden state is 
zero. The main objective is to obtain control policies whose closed loop generates the maximal set of safe recurrent 
states, which may include multiple recurrent classes. A design method is proposed that relies on a finitely parametrized 
convex program inspired on entropy maximization principles. A numerical example is provided and the adoption of 
additional constraints is discussed. 

I. Introduction 

The formalism of controlled Markov chains is widely used to describe the behavior of systems whose state 
transitions probabilistically among different configurations over time. Control variables act by dictating the state 
transition probabilities, subject to constraints that are specified by the model. Existing work has addressed the 
design of controllers that optimize a wide variety of costs that depend linearly on the parameters that characterize 
the probabilistic behavior of the system. The two most commonly used tools are linear and dynamic programming. 
For an extensive survey, see Q and the references therein. 

We focus on the design of time-invariant memoryless policies for fully observable controlled Markov chains with 
finite state and control spaces, represented as X and U, respectively. Given a pre-selected set F of forbidden states 
of X, a state is qualified as F-safe if it is not in F and the probability of it transitioning to an element of F is zero. 
Here, forbidden states may represent unwanted configurations. We address a problem on control design subject to 
safety constraints that consists on finding a control policy that leads to the maximal set of F-safe recurrent states 
Xjp\ This problem is relevant when persistent state visitation is desirable for the largest number of states without 
violating the safety constraint, such as in the context of persistent surveillance. 
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We show in Section |TTT] that the maximal set of F-safe recurrent states is well defined and achievable by 
suitable control policies. As we discuss in Remark 12.21 Xj? may contain multiple recurrent classes, but does not 
intersect the set of forbidden states F. 

A. Comparison with existing work 

Safety-constrained controlled Markov chains have been studied in a series of papers by Arapostathis et al., where 
the state probability distribution is restricted to be bounded above and below by safety vectors at all times. In flfQ, 
and 03], the authors propose algorithms to find the set of distributions whose evolution under a given control 
policy respect the safety constraint. In ifTHl . an augmented Markov chain is used to find the the maximal set of 
probability distributions whose evolution respect the safety constraint over all admissible non-stationary control 
policies. 

Here we are not concerned with the maximization of a given performance objective, but rather in systematically 
characterizing the maximal set of F-safe recurrent states and its corresponding control policies. The main contribution 
of this paper is to solve this problem via finitely parametrized convex programs. Our approach is rooted on entropy 
maximization principles, and the proposed solution can be easily implemented using standard convex optimization 
tools, such as the ones described in ifTTI . 

B. Paper organization 

The remainder of this paper is organized as follows. SectionlUprovides notation, basic definitions and the problem 
statement. The convex program that generates the maximal set of F-safe recurrent states is presented in Section [Til] 
along with a numerical example. Further considerations are given in Section IIVI while conclusions are discussed 
in Section M 

II. Preliminaries and Problem Statement 
The following notation is used throughout the paper: 



X 


state space of the Markov chain 


u 


set of control actions 


n 


cardinality of X 


m 


cardinality of U 


x k 


state of the Markov chain at time k 


u k 


control action at time k 


Px 


set of all pmfs with support in X 


Pu 


set of all pmfs with support in U 


Pxu 


set of all joint pmfs with support in X x U 


S/ 


support of a pmf / 
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The recursion of the controlled Markov chain is given by the (conditional) probability mass function of X k+ i 
given the previous state X k and control action Uk, and is denoted as: 

Q(x + ,x,u) d = P(Xk+i = x + \Xk = x,Uk = u), x + ,x eX, well. 

We denote any memoryless time-invariant control policy by 

K.(u, x) d = P(U k = u\X k =x), meU.igX 

where 2~2uev x) = 1 for all x in X. The set of all such policies is denoted as K. 

Assumption: Throughout the paper, we assume that the controlled Markov chain Q is given. Hence, all 
quantities and sets that depend on the closed loop behavior will be indexed only by the underlying control policy 
K. 

Given a control policy /C, the conditional state transition probability of the closed loop is represented as: 

PK(X k+1 =x + \X k = x) d ^^Q{x + ,x,u)K(u,x), x+,xeX (1) 

We define the set of recurrent states X^ and the set of ¥ -safe recurrent states X^ F under a control policy K. to be: 

X£ d ={x e X : P K (X k = x for some k > 0|X o = .t) = i| 
X£ F d = f {x e X£ : P K (X k+1 =x + \X k =x)=0, x + G f} 

The maximal set of F-safe recurrent states is defined as: 

KeK 

The problem we address in this paper is defined below: 

Problem 2.1: Given F, determine X^ and a control policy K* R such that X^» = X^. 
A solution to Problem l2.1l is provided in Section Hill where we also show the existence of JC* R . 

Remark 2.2: The following is a list of important observations on Problem |2. II 

• The set X|? may contain more than one recurrent class and it will exclude any recurrent class that intersects 
F. 

• There is no K. such that the states in X\X|? can be F-safe and recurrent. 

• If the closed loop Markov chain is initialized in Xj^ then the probability that it will ever visit a state in F is 
zero. 
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III. Maximal F-Safe Set of Recurrent States 
We propose a convex program to solve problem 12.11 Consider now the following convex optimization program: 

(2) 

}x u Gr xu 

subject to: 



f xu =&ig max H(fxu) 

JXU&^XV 



^2 fxu(x + ,u + ) 



u+eu 



E 



Q(x + ,x, u)fxu(x,u), x + e 



(3) 
(4) 



where % 



- xv 



— > SR>o is the entropy of fxu, an d is given by 



n(fxu) 



f xu (x,u) ln(fxu(x,u)) 



where we adopt the standard convention that ln(0) = 0. 
The following Theorem provides a solution to Problem 12.11 

Theorem 3.1: Let F be given, and assume that (OJ-© is feasible and that f xu is the optimal solution. In addition, 
adopt the marginal pmf f x (x) = 2~2 u ev fxu ( x > w ) an< ^ cons ider that Q : U x X — ► [0, 1] is any function satisfying 
Sugu x ) = I for all x in X. The following holds: 

(a) X£ F C S/» for all K in K. 

(b) X« = S / » 

(c) Xg, = X^ for /C^ given by: 



(«,i)eOxX 



(5) 



x e S f 
G(u,x), otherwise 
where we use §/* = {x G X : f x (x) > 0}. 
The proof of Theorem 13.11 is given at the end of this section. 

Example 3.2: (Computation of the maximal set of F-safe recurrent states) Suppose X = {1,...,8} and 
U = {1,2}, and consider the controlled Markov chain whose probability transition matrix is Q u , where 

Q(hj, u )- 



Q 1 - 
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The interconnection diagram of this controlled Markov chain can be seen in Fig. [TJ Suppose that state 4 is 
undesirable and should never be visited (in other words, F = {4}). By solving (O-©, we find that: 

.15 .15 
.15 



Gi,e -5 1 
G 2 , 6 .5 oj ' 

where K*j a = JC* R (i,j); and G € [0, l] 2x8 is any matrix whose the columns sum up to 1. 

It is important to highlight some interesting points that would otherwise not be clear if we were to consider a 
large system. Note that state 6 is not in X^ because regardless of which control action is chosen the probability of 
transitioning to F is always a positive. State 5 cannot be made recurrent even though it is a safe state. Furthermore, 
when the chain visits states 1 and 8, one of the two available control actions cannot be chosen since that choice 
leads to a positive probability of reaching F. In this scenario there are two safe recurrent classes: {1,2,3} and {7,8}. 
Note that the control action 1 cannot be chosen when the chain visits state 3 because that choice makes the states 
1,2, and 3 transient. 

Remark 3.3: Consider a control policy JC for which IC(x, u) > if and only if IC* R (x, u) > 0, where JC* R is an 
optimal solution given by ©. It holds that Xj<? = » = X^. 



F* = 



.16 
.08 .11 .2 



where F*j = fxui^j)- This implies by Theorem IXTl that: 

X£ = {1,2,3,7,8}, 
and the associated control policy is given by the matrix: 

.58 Gi, 4 Gi, 5 



K* 



1 



.42 1 G 2 



Go 
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Remark 3.4: Consider a control policy JC for which X^j = X F . It holds that K{x, u) = if K-* R (x, u) = 0. 

A. Proof of Theorem 13.71 

To facilitate the proof we first establish the following Lemma: 

Lemma 3.5: Let Y be a finite set and W be a convex subset of Py, the set of all pmfs with support in Y. Consider 
the following problem: 

/* = argmaxHff) 

J 6 /GW W ^ 



where %(/) is the entropy of / in Py and is given by %{f) = — X^gY f(v) m (/(j/))> where we adopt the 
convention that ln(0) = 0. The following holds: 

S/ c S/., / e W 

where S/ = {?/ £ Y|/(y) > 0}, / G P Y . 

Proof: Select an arbitrary / in W and define fx = A/* + (1 — A)/ for < A < 1. From the convexity of W, 
we conclude that fx is in W for all A in [0, 1]. Since /* has maximal entropy, it must be that there exists a A in 
[0,1) such that 

4«(/a) >0, Ag (A,l). (6) 
dX 

Proof by contradiction: : Suppose that Sf §/* and hence that there exists a y' in Y such that f{y') > and 
f*( y >) = 0. We have that 

±(^fx(y')]n(fx(y')^ = -f(y')(Hfx(y)) + l) 
goes to oo, as A approaches 1, since linu^i fx{y') = 0. This implies that there exists a A in [0, 1) such that 

^H(fx)<0, AG (A,l), 

which contradicts ([6j. ■ 
See ([8 1) for an alternative proof that relies on the concept of relative entropy. 
Proof of Theorem 13.71 

(a) (Proof that X^ F C Sf* holds for all /C in K.) Select an arbitrary control policy /C in K. There are two possible 
cases: i) When X^ F is the empty set, the statement follows trivially, ii) If X^ F is non-empty then the closed 
loop must have an invariant pmf f^u that satisfies the following: 

f% u (x+,u+)=)C(u + ,x + ) Q(^ + ,^w)/^(x, W ), (7) 

a;GX,«GU 

fxu(x,u) > 0, x G X^ F ; (recurrence) (8) 

mGU 

]T f§u(x, u) = 0, x e F. (F-safety) (9) 

mGU 



7 

Equation (0 follows from the fact that f xu is an invariant pmf of the closed loop, while dS}-® follow from 
the definition of r . 

Our strategy to conclude the proof (that X^j F C S/« holds) is to show that f^jj is a feasible solution of the 
convex program (|2j-(|4]i, after which we can use Lemma 13.51 In order to show the aforementioned feasibility, 
note that summing both sides of equation (O over the set of control actions yields (O, where we use the 
fact that J2 u + e u IC(u + ,x + ) = 1. Moreover, the constraint (0]l and the F-safety equality in (0 are identical. 
Therefore, f xu is a feasible pmf to ©-(Uli. By Lemma [331 it follows that SfK Q§>f^ and, consequently, that 
Xjg )F C S/» . 

(b) (Proof that X^ = S/» holds ) It follows from (a) that X$ C S/^. To prove that §/» C Xf, select an optimal 
policy AlTJj as in (0, and note that the corresponding closed loop pmf f^u is an invariant distribution, leading 
to: 

f* xu (x + ,u + )=lC R {u + ,x + ) Q{x + ,x,u)f xu (x,u). 

xgX,«SU 

Consider any element x in S/^. Since J^uev fxu(^> u) > holds and from the fact that f xu is an invariant 
distribution of the closed loop, we conclude that the following must be satisfied: 

Pk* (Xk = x for some k > 0\Xq = x) = 1. 

This means that x belongs to X^, . From it is clear that x is an F-safe state and, thus, belongs to X^, F . 
Hence, by definition, x belongs to Xj?. Since the choice of x in S/^ was arbitrary, we conclude that §j« C Xj^. 

(c) (Proof that Xg. = X^ holds) Follows from the proof of (b). 



IV. Further Considerations 
Computational complexity reduction. Consider the following convex program: 

fxu=axg max %( ^ fxu (;U)\ , 

fxu e P xu ueu 

s.t. © and Q 

where the objective function has been modified to be the entropy of the marginal pmf with respect to the state (rather 
than the joint entropy as in (O). A simple modification of Theorem 13.11 leads to the conclusion that Sjf x = Sf^. 
Therefore, the modified program also provides a solution to Problem 12.11 with the advantage that it requires fewer 
calls to the entropy function, thus reducing computational complexity. However, the optimal control policy K, R 
(obtained in an analogous manner as in (0) may differ from K* R . The most significant difference is that Remark 
13.41 does not apply to K, R . 
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Additional constraints. Further convex constraints on fxu ma y be incorporated in the optimization problem ©-Q 
without affecting its tractability. For instance, consider constraints of the following type: 

]T h(u)f xu (x,u) </3, (10) 

where h is an arbitrary function and f3 an arbitrary real number. Let f xu be an optimal solution to (O-© and 
( [Tol l. If the Markov chain is initialized with f xu (an invariant distribution), the following holds for each k: 

E[h(U k )} < /3. 

Moreover, when Xp contains only one aperiodic recurrent class, the following holds for any initial distribution: 

lim E[h(U k )] < 0. 

V. Conclusion 

This paper addresses the design of full-state feedback policies for controlled Markov chains defined on finite 
alphabets. The main problem is to design policies that lead to the largest set of recurrent states, for which the 
probability to transition to a pre-selected set of forbidden states is zero. The paper describes a finitely parametrized 
convex program that solves the problem via entropy maximization principles. 
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